AITopics | harmful instruction

Collaborating Authors

harmful instruction

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs

Yang, Xikang, Zhou, Biyu, Tang, Xuehai, Han, Jizhong, Hu, Songlin

arXiv.org Artificial IntelligenceNov-18-2025

Large Language Models (LLMs) demonstrate impressive capabilities across a wide range of tasks, yet their safety mechanisms remain susceptible to adversarial attacks that exploit cognitive biases -- systematic deviations from rational judgment. Unlike prior jailbreaking approaches focused on prompt engineering or algorithmic manipulation, this work highlights the overlooked power of multi-bias interactions in undermining LLM safeguards. We propose CognitiveAttack, a novel red-teaming framework that systematically leverages both individual and combined cognitive biases. By integrating supervised fine-tuning and reinforcement learning, CognitiveAttack generates prompts that embed optimized bias combinations, effectively bypassing safety protocols while maintaining high attack success rates. Experimental results reveal significant vulnerabilities across 30 diverse LLMs, particularly in open-source models. CognitiveAttack achieves a substantially higher attack success rate compared to the SOTA black-box method PAP (60.1% vs. 31.6%), exposing critical limitations in current defense mechanisms. These findings highlight multi-bias interactions as a powerful yet underexplored attack vector. This work introduces a novel interdisciplinary perspective by bridging cognitive science and LLM safety, paving the way for more robust and human-aligned AI systems.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2507.22564

Country: Europe > Austria (0.28)

Genre: Research Report > New Finding (0.93)

Industry:

Information Technology > Security & Privacy (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)

Add feedback

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

Yang, Yudong, Zhang, Xuezhen, Han, Zhifeng, Wang, Siyin, Zhuang, Jimin, Jin, Zengrui, Shao, Jing, Sun, Guangzhi, Zhang, Chao

arXiv.org Artificial IntelligenceNov-17-2025

Recent progress in large language models (LLMs) has enabled understanding of both speech and non-speech audio, but exposing new safety risks emerging from complex audio inputs that are inadequately handled by current safeguards. We introduce SACRED-Bench (Speech-Audio Composition for RED-teaming) to evaluate the robustness of LLMs under complex audio-based attacks. Unlike existing perturbation-based methods that rely on noise optimization or white-box access, SACRED-Bench exploits speech-audio composition mechanisms. SACRED-Bench adopts three mechanisms: (a) speech overlap and multi-speaker dialogue, which embeds harmful prompts beneath or alongside benign speech; (b) speech-audio mixture, which imply unsafe intent via non-speech audio alongside benign speech or audio; and (c) diverse spoken instruction formats (open-ended QA, yes/no) that evade text-only filters. Experiments show that, even Gemini 2.5 Pro, the state-of-the-art proprietary LLM, still exhibits 66% attack success rate in SACRED-Bench test set, exposing vulnerabilities under cross-modal, speech-audio composition attacks. To bridge this gap, we propose SALMONN-Guard, a safeguard LLM that jointly inspects speech, audio, and text for safety judgments, reducing attack success down to 20%. Our results highlight the need for audio-aware defenses for the safety of multimodal LLMs. The benchmark and SALMONN-Guard checkpoints can be found at https://huggingface.co/datasets/tsinghua-ee/SACRED-Bench. Warning: this paper includes examples that may be offensive or harmful.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2511.10222

Genre: Research Report > New Finding (0.66)

Industry:

Information Technology > Security & Privacy (0.46)
Government > Military (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback

Chain-of-Thought Hijacking

Zhao, Jianli, Fu, Tingchen, Schaeffer, Rylan, Sharma, Mrinank, Barez, Fazl

arXiv.org Artificial IntelligenceNov-12-2025

Large reasoning models (LRMs) achieve higher task performance with more inference-time computation, and prior works suggest this scaled reasoning may also strengthen safety by improving refusal. Yet we find the opposite: the same reasoning can be used to bypass safeguards. We introduce Chain-of-Thought Hijacking, a jailbreak attack on reasoning models. The attack pads harmful requests with long sequences of harmless puzzle reasoning. Across HarmBench, CoT Hijacking reaches a 99%, 94%, 100%, and 94% attack success rate (ASR) on Gemini 2.5 Pro, GPT o4 mini, Grok 3 mini, and Claude 4 Sonnet, respectively - far exceeding prior jailbreak methods for LRMs. To understand the effectiveness of our attack, we turn to a mechanistic analysis, which shows that mid layers encode the strength of safety checking, while late layers encode the verification outcome. Long benign CoT dilutes both signals by shifting attention away from harmful tokens. Targeted ablations of attention heads identified by this analysis causally decrease refusal, confirming their role in a safety subnetwork. These results show that the most interpretable form of reasoning - explicit CoT - can itself become a jailbreak vector when combined with final-answer cues. We release prompts, outputs, and judge decisions to facilitate replication.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2510.26418

Country:

North America > United States > California (0.28)
Europe > United Kingdom (0.28)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Law Enforcement & Public Safety > Terrorism (0.94)
Law (0.89)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test

Yang, Guan-Yan, Cheng, Tzu-Yu, Teng, Ya-Wen, Wanga, Farn, Yeh, Kuo-Hui

arXiv.org Artificial IntelligenceOct-28-2025

The integration of Large Language Models (LLMs) into computer applications has introduced transformative capabilities but also significant security challenges. Existing safety alignments, which primarily focus on semantic interpretation, leave LLMs vulnerable to attacks that use non-standard data representations. This paper introduces ArtPerception, a novel black-box jailbreak framework that strategically leverages ASCII art to bypass the security measures of state-of-the-art (SOTA) LLMs. Unlike prior methods that rely on iterative, brute-force attacks, ArtPerception introduces a systematic, two-phase methodology. Phase 1 conducts a one-time, model-specific pre-test to empirically determine the optimal parameters for ASCII art recognition. Phase 2 leverages these insights to launch a highly efficient, one-shot malicious jailbreak attack. We propose a Modified Levenshtein Distance (MLD) metric for a more nuanced evaluation of an LLM's recognition capability. Through comprehensive experiments on four SOTA open-source LLMs, we demonstrate superior jailbreak performance. We further validate our framework's real-world relevance by showing its successful transferability to leading commercial models, including GPT-4o, Claude Sonnet 3.7, and DeepSeek-V3, and by conducting a rigorous effectiveness analysis against potential defenses such as LLaMA Guard and Azure's content filters. Our findings underscore that true LLM security requires defending against a multi-modal space of interpretations, even within text-only inputs, and highlight the effectiveness of strategic, reconnaissance-based attacks. Content Warning: This paper includes potentially harmful and offensive model outputs.

artperception, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.jnca.2025.104356

2510.10281

Country: North America > United States (0.93)

Genre: Research Report > New Finding (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HauntAttack: When Attack Follows Reasoning as a Shadow

Ma, Jingyuan, Li, Rui, Li, Zheng, Liu, Junfeng, Xia, Heming, Sha, Lei, Sui, Zhifang

arXiv.org Artificial IntelligenceOct-24-2025

Emerging Large Reasoning Models (LRMs) consistently excel in mathematical and reasoning tasks, showcasing remarkable capabilities. However, the enhancement of reasoning abilities and the exposure of internal reasoning processes introduce new safety vulnerabilities. A critical question arises: when reasoning becomes intertwined with harmfulness, will LRMs become more vulnerable to jailbreaks in reasoning mode? To investigate this, we introduce HauntAttack, a novel and general-purpose black-box adversarial attack framework that systematically embeds harmful instructions into reasoning questions. Specifically, we modify key reasoning conditions in existing questions with harmful instructions, thereby constructing a reasoning pathway that guides the model step by step toward unsafe outputs. We evaluate HauntAttack on 11 LRMs and observe an average attack success rate of 70\%, achieving up to 12 percentage points of absolute improvement over the strongest prior baseline. Our further analysis reveals that even advanced safety-aligned models remain highly susceptible to reasoning-based attacks, offering insights into the urgent challenge of balancing reasoning capability and safety in future model development.

information, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2506.07031

Country:

Europe > Austria (0.28)
North America > United States (0.28)

Genre:

Research Report > New Finding (1.00)
Overview (0.93)

Industry:

Law Enforcement & Public Safety (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Government > Military (0.88)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(3 more...)

Add feedback

VisualDAN: Exposing Vulnerabilities in VLMs with Visual-Driven DAN Commands

Liu, Aofan, Tang, Lulu

arXiv.org Artificial IntelligenceOct-14-2025

Vision-Language Models (VLMs) have garnered significant attention for their remarkable ability to interpret and generate multimodal content. However, securing these models against jailbreak attacks continues to be a substantial challenge. Unlike text-only models, VLMs integrate additional modalities, introducing novel vulnerabilities such as image hijacking, which can manipulate the model into producing inappropriate or harmful responses. Drawing inspiration from text-based jailbreaks like the "Do Anything Now" (DAN) command, this work introduces VisualDAN, a single adversarial image embedded with DAN-style commands. Specifically, we prepend harmful corpora with affirmative prefixes (e.g., "Sure, I can provide the guidance you need") to trick the model into responding positively to malicious queries. The adversarial image is then trained on these DAN-inspired harmful texts and transformed into the text domain to elicit malicious outputs. Extensive experiments on models such as MiniGPT-4, MiniGPT-v2, InstructBLIP, and LLaVA reveal that VisualDAN effectively bypasses the safeguards of aligned VLMs, forcing them to execute a broad range of harmful instructions that severely violate ethical standards. Our results further demonstrate that even a small amount of toxic content can significantly amplify harmful outputs once the model's defenses are compromised. These findings highlight the urgent need for robust defenses against image-based attacks and offer critical insights for future research into the alignment and security of VLMs.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.09699

Genre: Research Report > New Finding (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Defending MoE LLMs against Harmful Fine-Tuning via Safety Routing Alignment

Kim, Jaehan, Song, Minkyoo, Shin, Seungwon, Son, Sooel

arXiv.org Artificial IntelligenceOct-10-2025

Recent large language models (LLMs) have increasingly adopted the Mixture-of-Experts (MoE) architecture for efficiency. MoE-based LLMs heavily depend on a superficial safety mechanism in which harmful inputs are routed safety-critical experts. However, our analysis reveals that routing decisions for harmful inputs drift significantly after fine-tuning, exposing a critical vulnerability to harmful fine-tuning (HFT) attacks. Existing defenses, primarily designed for monolithic LLMs, are less effective for MoE LLMs as they fail to prevent drift in harmful input routing. To address this limitation, we propose SafeMoE, a safe fine-tuning method tailored to MoE LLMs. SafeMoE directly mitigates routing drift by penalizing the gap between the routing weights of a fine-tuned model and those of the initial safety-aligned model, thereby preserving the safety-aligned routing of harmful inputs to safety-critical experts. Experiments on open-source MoE LLMs ranging from 7B to 141B parameters demonstrate that SafeMoE effectively mitigates HFT attacks, reducing the harmfulness score of OLMoE from 62.0 to 5.0, for example, while maintaining task utility within 1% degradation and incurring only 2% overhead. It significantly outperforms state-of-the-art defense methods for safeguarding LLM fine-tuning and remains effective in recent large-scale MoE LLMs such as gpt-oss and Llama 4. Our implementation is available at https://anonymous.4open.science/r/SafeMoE.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2509.22745

Genre: Research Report > New Finding (0.93)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study

Lee, DongGeon, Jang, Joonwon, Jeong, Jihae, Yu, Hwanjo

arXiv.org Artificial IntelligenceSep-24-2025

Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms. MemeSafetyBench is publicly available at https://github.com/oneonlee/Meme-Safety-Bench.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2505.15389

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)

Genre: Research Report > New Finding (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ENJ: Optimizing Noise with Genetic Algorithms to Jailbreak LSMs

Zhang, Yibo, Lin, Liang

arXiv.org Artificial IntelligenceSep-16-2025

These samples sound like harmless noise to humans but can induce the model to parse and execute harmful commands. Extensive experiments on multiple mainstream speech models show that ENJ's attack effectiveness is significantly superior to existing baseline methods. This research reveals the dual role of noise in speech security and provides new critical insights for model security defense in complex acoustic environments. Index T erms-- Large Speech Model, Jailbreak Attack, Genetic Algorithm, Environmental Noise 1. INTRODUCTION Driven by deep learning and large-scale data, Large-scale Speech Models (LSMs) have made remarkable progress, profoundly changing the way of human - computer interaction. As these models become increasingly capable and widely used in voice control systems, the security risks they expose are in urgent need of in-depth examination [1, 2]. Different from text-based models, LSMs essentially process information transmitted in audible audio signals, which creates a unique attack surface. Among them, "Jailbreaking" is a key threat [3, 4]. In jailbreaking attacks, attackers aim to construct specific inputs to induce the model to bypass its built - in security protection mechanisms and execute harmful instructions while keeping the output semantically understandable [5].

evolutionary algorithm, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2509.11128

Country: North America > United States > New Mexico (0.14)

Genre: Research Report (0.50)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Evolutionary Systems (0.85)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback

Mitigating Jailbreaks with Intent-Aware LLMs

Yeo, Wei Jie, Satapathy, Ranjan, Cambria, Erik

arXiv.org Artificial IntelligenceAug-26-2025

Despite extensive safety-tuning, large language models (LLMs) remain vulnerable to jailbreak attacks via adversarially crafted instructions, reflecting a persistent trade-off between safety and task performance. We comprehensively evaluate both parametric and non-parametric attacks across open-source and proprietary models, considering harmfulness from attacks, utility, over-refusal, and impact against white-box threats. Importantly, our method preserves the model's general capabilities and reduces excessive refusals on benign instructions containing superficially harmful keywords. With the rapid advancement of large language models (LLMs) (Grattafiori et al., 2024; Y ang et al., 2025; Liu et al., 2024; Mao et al., 2024), the risk of these models executing harmful or catastrophic instructions has grown correspondingly (Anthropic, 2025). This is largely managed by efforts such as dedicated a safety-alignment stage (Ouyang et al., 2022), aiming to ensure that LLMs are not only helpful but also consistently generate safe and ethical outputs. Nevertheless, recent findings by Qi et al. (2024) expose a fundamental vulnerability in prevailing safety-alignment practices: Shallow Alignment . In particular, alignment in most models is largely superficial--constrained to surface-level refusals--resulting in safe outputs that are often limited to generic templates such as "I am sorry but... " or "As a language model... " . This superficial alignment permits attackers to circumvent safety mechanisms by explicitly instructing the model to avoid generating commonly recognized refusal responses (Tang, 2024; Andriushchenko et al., 2025). Furthermore, LLMs remain susceptible to a broader range of prompt-based attacks, including those that optimize over discrete suffix tokens (Zou et al., 2023; Basani & Zhang, 2025) or rephrase harmful instructions to look harmless (Chao et al., 2025; Zeng et al., 2024). Beyond initial safety alignment, practitioners have developed a range of inference-time defenses, such as prompting models to adhere to their safety guidelines (Xie et al., 2023) incorporating additional safety exemplars to enable in-context defense (Wei et al., 2023). Wang et al. (2024) introduce a backdoor trigger into safety-aligned LLMs, serving as a covert prefix that elicits safety responses when detected, without affecting model behavior on benign queries. In recent works, Zhang et al. (2024) introduced a dual-stage prompting strategy, Intention Analysis (IA), which encourages LLMs to analyze the intent behind an instruction prior to generating a safe response.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.12072

Genre: Research Report (1.00)

Industry:

Government > Military (0.69)
Information Technology > Security & Privacy (0.47)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback